Conversation
|
Thanks for diving in to add the fetch functionality @kba. I wonder if it might be a bit more readable to rename |
|
Also, I haven't worked with fetch.txt files much before. But I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory. See: https://github.com/LibraryOfCongress/bagit-python/blob/master/test.py#L1023 |
Fine with me, I wanted to avoid confusion with
Sure.
These tests break the "Every file listed in the fetch file MUST be listed in every payload manifest" rule and it isn't validated. Since the manifests determines the number and size of files, it could make sense to allow "bag with holes" validation against only the files not mentioned in fetch.txt with a special parameter though, if you don't want to fetch the whole thing. By default,
should not be valid, you're right. |
|
I guess we could consider validation as a separate issue from this PR though. |
By default only allow HTTP(S) and (S)FTP URL
| self.normalized_manifest_names = {} | ||
|
|
||
| self.algorithms = [] | ||
| self.fetch_url_whitelist = DEFAULT_FETCH_URL_WHITELIST if fetch_url_whitelist is None else fetch_url_whitelist |
There was a problem hiding this comment.
This could be simplified:
| self.fetch_url_whitelist = DEFAULT_FETCH_URL_WHITELIST if fetch_url_whitelist is None else fetch_url_whitelist | |
| self.fetch_url_whitelist = fetch_url_whitelist or DEFAULT_FETCH_URL_WHITELIST |
| for url, expected_size, filename in self.fetch_entries(): | ||
| if not fnmatch_any(url, self.fetch_url_whitelist): | ||
| raise BagError(_("Malformed URL in fetch.txt: %s, matches none of the whitelisted URL patterns %s") % (url, self.fetch_url_whitelist)) | ||
| expected_size = -1 if expected_size == '-' else int(expected_size) |
There was a problem hiding this comment.
I'm thinking this would be more idiomatic as:
| expected_size = -1 if expected_size == '-' else int(expected_size) | |
| expected_size = None if expected_size == '-' else int(expected_size) |
There was a problem hiding this comment.
I also wanted to check whether fetch_entries() is correctly validating that these values are either “-” or digits.
| for _, _, filename in self.fetch_entries(): | ||
| yield filename | ||
|
|
||
| def fetch(self, force=False): |
There was a problem hiding this comment.
I'm wondering whether we should support force() or simply have the user handle file operations outside of scope. The main thing which that does is open up the need to think about potential security concerns so perhaps that should just be confirming that we correctly validate file paths not being outside of the bag scope.
| else: | ||
| content_length = int(headers['content-length']) | ||
| if content_length != expected_size: | ||
| raise BagError(_("Inconsistent size of %s: Expected %s but Content-Length is %s") % (filename, expected_size, content_length)) |
There was a problem hiding this comment.
I was wondering whether this should have the URL as well:
| raise BagError(_("Inconsistent size of %s: Expected %s but Content-Length is %s") % (filename, expected_size, content_length)) | |
| raise BagError(_("Inconsistent size for %s: expected %s bytes but %s had a Content-Length of %s") % (filename, expected_size, url, content_length)) |
| resp = opener.open(req) | ||
| headers = resp.info() | ||
| if expected_size >= 0: | ||
| if "content-length" not in headers: |
There was a problem hiding this comment.
This would only apply to HTTP requests but the intention of this code is to support other protocols, which suggests that this would need to be a conditional warning since I don't believe the stdlib sets a header like that for other protocols.
| parsed_url = urlparse(url) | ||
|
|
||
| if not all((parsed_url.scheme, parsed_url.netloc)): | ||
| raise BagError(_("Malformed URL in fetch.txt: %s") % url) |
There was a problem hiding this comment.
I think we should still have this validation step since it reports some malformed URLs earlier. This would also allow us to remove both of the fnmatch checks by changing the whitelist to be a list of URL schemes and simply adding a second if parsed_url.scheme not in self.fetch_url_schema_whitelist: … check here.
Adds a new method
Bag.fetch_files_to_be_fetched()that fetches files listed infetch.txt, c.f. #118.If this is useful for someone, can be further refined (CLI, overrideable fetch implementation, anti-hammering interval).